class: left, middle, inverse, hide-logo ### Boosting methods for regression <!-- English title page --> #### Emanuel Sommer #### 2021-06-22 <center> <br><br><br> <img src="img/maTUM_w.png" width="100"> <br> <br> Department of Mathematics <br> Technical University of Munich </center> --- ## Basic Notation Training data set: `$$\mathcal{D} = \{(y_i,x_i)\ | i \in [N]\}$$` Response: `\(y_i \in \mathbb{R}\)` like the burnout rate or the charges. Predictors: `\(x_i \in \mathbb{R}^m\)` .pull-left[ .right[ `\(\textbf{y} = \begin{pmatrix}y_1\\ \vdots \\ y_N \end{pmatrix}\)` ] ] .pull-right[ .left[ `\(\textbf{x} = \begin{pmatrix}x_1\\ \vdots \\ x_N \end{pmatrix}\)` ] ] --- ## Previously .pull-left[ ### CART <img src="data:image/png;base64,#img/tree_illustration.png" width="45%" style="display: block; margin: auto;" /> Let `\(\mathcal{T}\)` be the space of CART models. `$$t(x_i, \gamma, R) = \sum_{j=1}^J \gamma_j I(x_i \in R_j) \quad t \in \mathcal{T}$$` with `\(R_j\)` being distinct regions of the predictor space and `\(\gamma_j\)` the prediction w.r.t. the region. ] -- .pull-right[ ### Random forest <img src="data:image/png;base64,#img/rf_illustration.png" width="70%" style="display: block; margin: auto;" /> Build many deep trees in parallel and average over the individual predictions. Randomness is induced to decorrelate the trees. ] --- background-image: url("data:image/png;base64,#https://media.giphy.com/media/cRH5deQTgTMR2/giphy.gif") background-size: cover class: center, bottom, hide-logo --- ## The core idea of boosting **Sequentially** build weak learners - here shallow trees - that are ultimately combined in an additive fashion. Each additional learner should improve the total model. <img src="data:image/png;base64,#img/boost_illustration.png" width="50%" style="display: block; margin: auto;" /> **The general additive model:** `$$\hat{y}_i = \phi(x_i) = \sum_{k=1}^{K} t_k(x_i), \quad t_k \in \mathcal{T}$$` --- ## The fitting approach To fit such a model one uses **Forward Stagewise Additive Modeling**. * Given the current model `\(\phi_{k-1}(\textbf{x})\)` * Add the tree `\(t_k\)` with regions `\(R^{(k)}\)` and leaf predictions `\(\gamma^{(k)}\)` * The parameters of the current model are unchanged! `$$\phi_k(\textbf{x}) = \phi_{k-1}(\textbf{x}) + t_k(\textbf{x}, \gamma^{(k)}, R^{(k)})$$` Each additional tree should improve the overall model i.e. solve the following optimization problem: `$$(\gamma^{(k)},R^{(k)}) = argmin_{\gamma,R} \sum_{i=1}^N L(y_i, \phi_{k-1}(x_i) + t(x_i,\gamma,R))$$` where `\(L\)` is a convex and differentiable loss function. --- ## Connection with Gradient descent **General minimization problem:** `$$\textbf{s}_{opt} = argmin_\textbf{s} f(\textbf{s}) \quad \textbf{s} \in \mathbb{R}^N$$` Where `\(f\)` is differentiable and `\(\lambda \in \mathbb{R}\)` a scalar. **Gradient descent in a nutshell:** $$ \textbf{s}^{(k)} = \textbf{s}^{(k-1)} - \lambda \bigg[\frac{\partial f(\textbf{s})}{\partial \textbf{s}}\bigg]_{\textbf{s}=\textbf{s}^{(k-1)}} $$ For now set `\(\lambda = 1\)`. --- ## Connection with Gradient descent .pull-left[ **General minimization setup:** `$$\textbf{s}_{opt} = argmin_\textbf{s} f(\textbf{s}) \quad \textbf{s} \in \mathbb{R}^N$$` <br> Gradient descent: $$ \textbf{s}^{(k)} = \textbf{s}^{(k-1)} - \bigg[\frac{\partial f(\textbf{s})}{\partial \textbf{s}}\bigg]_{\textbf{s}=\textbf{s}^{(k-1)}} $$ ] .pull-right[ **The boosting setup:** `$$\phi_{opt} = argmin_{\phi} \sum_{i=1}^N L(y_i, \phi(x_i))$$` <br> `$$\phi_k(\textbf{x}) = \phi_{k-1}(\textbf{x}) + t_k(\textbf{x}, \gamma^{(k)}, R^{(k)})$$` ] Corresponding parts: * The loss function `\(L\)` to the function `\(f\)` * The predictions `\(\phi_k(\textbf{x})\)` to the input `\(s^{(k)}\)` * The predictions of the tree `\(t_k\)` to the negative gradient of `\(f\)` --- ## Tree-based gradient boosting **Algorithm:** **1.** Initialize `\(\phi_0(\textbf{x})\)` as a singular node tree. <br> **Example:** Burnout data with the RMSE loss, fixed `\(K = 2\)` and tree depth 1. <br> <img src="data:image/png;base64,#img/grad_boost_example_init.png" width="95%" style="display: block; margin: auto;" /> --- ## Tree-based gradient boosting **2.** For each tree addition `\(k = 1\)` to `\(K\)` do: - For each observation `\(i = 1\)` to `\(N\)` compute: `\(g^{(k)}_{i} = \bigg[\frac{\partial L(y_i, \phi(x_i))}{\partial\phi(x_i)}\bigg]_{\phi = \phi_{k-1}}\)` - Fit a regression tree by least squares to the outcome vector `\(-\textbf{g}^{(k)}\)` in order to get the `\(J^{(k)}\)` distinct regions `\(\tilde{R}^{(k)}_j\)`. - For each of these `\(J^{(k)}\)` regions perform a line search in order to compute the leaf predictions `\(\tilde{\gamma}^{(k)}_{j}\)`. - Set `\(\phi_k(\textbf{x}) = \phi_{k-1}(\textbf{x}) + t(\textbf{x},\tilde {\gamma}^{(k)}_{j},\tilde{R}^{(k)}_j)\)` with `\(t \in \mathcal{T}\)` <!-- ## Which loss to pick? ### `\(L_2\)` loss ➖ Sensitive to outliers. ➕ Simplifications possible as the gradient is the residuals. .pull-left[ ### Alternatives * `\(L_1\)` loss * Huber loss ] .pull-right[ ### Comparison <img src="data:image/png;base64,#C:/Users/Emanuel/Documents/Uni/3. MA Semester/Seminar/boosting_methods/_pictures/huber_loss.png" width="90%" style="display: block; margin: auto auto auto 0;" /> ] --> --- ## The example continued <img src="data:image/png;base64,#img/grad_boost_example.png" width="95%" style="display: block; margin: auto;" /> --- ## Combat Overfitting ### Early stopping Monitor the loss on one or many (CV) validation sets and stop the addition of trees when the validation loss increases. -- ### Shrinkage Introduce a learning rate `\(\eta\)` that scales the contribution of the new model. `$$\phi_k(\textbf{x}) = \phi_{k-1}(\textbf{x}) + \eta * t(\textbf{x},\tilde {\gamma}^{(k)}_{j},\tilde{R}^{(k)}_j)$$` -- ### Subsampling **Row-subsampling:** In each step use only a fraction of the training data. **Column-subsampling:** In each step use only a fraction of the training features. --- ## An implementation: XGBoost <br> <center> ** A highly scalable end-to-end tree boosting system.** </center> <br> * Strong regularization - Lasso/ Ridge penalization on `\(\gamma\)` - Shrinkage ✔️ - Subsampling ✔️ - Early stopping ✔️ - Maximum tree depth ✔️ * Parallelized tree building * Extremely efficient on sparse data and can handle missing values by having learned default directions in each split * Cache aware algorithm Also very popular in R: `gbm` package. (Same functionality: ✔️) <!-- ## XGBoost vs overfitting ### Regularized Loss `$$\mathcal{L}(\phi) = L(\phi) + \sum_{k=1}^K \Omega_{\alpha, \lambda, \nu}(t_k)$$` Where `\(\Omega_{\alpha, \lambda, \nu}\)` is the penalizing term for model complexity. * `\(\alpha\)` is a `\(L_1\)` regularization parameter on `\(\gamma\)`. * `\(\lambda\)` is a `\(L_2\)` regularization parameter on `\(\gamma\)`. * `\(\nu\)` is the minimal loss reduction for terminal partitions. * Second order approximations of the loss allow for better parallelization. .pull-left[ ### Shrinkage ### Subsampling ] .pull-right[ ### Early stopping ### Maximum tree depth ] --> --- background-image: url("data:image/png;base64,#https://media.giphy.com/media/BpGWitbFZflfSUYuZ9/giphy.gif") background-size: cover class: center, bottom, hide-logo --- ## Burnout data set **Goal:** Predict the burnout rate of an employee. .left-column[ <br> **Number of predictors:** 9 **Training data:** 17301 rows **Test data:** 4325 rows **Missing:** 14% ] .right-column[ <img src="data:image/png;base64,#C:/Users/Emanuel/Documents/Uni/3. MA Semester/Seminar/boosting_methods/_pictures/burn_rate_raw_nojitter.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Transformed outcome The transformation with the empirical logit removes the bounds on the target. `$$log(\frac{x+0.5}{1-x+0.5})$$` <img src="data:image/png;base64,#C:/Users/Emanuel/Documents/Uni/3. MA Semester/Seminar/boosting_methods/_pictures/burn_rate_trans_nojitter.png" width="65%" style="display: block; margin: auto;" /> <!-- ## Mental fatigue score <img src="data:image/png;base64,#C:/Users/Emanuel/Documents/Uni/3. MA Semester/Seminar/boosting_methods/_pictures/mfs_main.png" width="70%" style="display: block; margin: auto;" /> Proposed baseline model: $$ \text{burn_rate}_i = \beta_0 + \beta_1 \text{ mental_fatigue_score}_i + \epsilon_i \quad \text{with } \epsilon_i \sim N(0,\sigma^2) $$ --> <!-- ## Two closely related features <br> <img src="data:image/png;base64,#C:/Users/Emanuel/Documents/Uni/3. MA Semester/Seminar/boosting_methods/_pictures/designationVsRessources.png" width="80%" style="display: block; margin: auto;" /> Both features show relevant main effects. --> --- ## Prepare for takeoff ### Pre-processing / Feature Engineering * All variables are used without normalization. * Extract extra features `day of the week` and `month` of the date of joining. * Dummy-encoding of all nominal variables. -- ### Resampling **5-fold cross validation** is used for model evaluation e.g. hyperparameter tuning. <img src="data:image/png;base64,#img/cv5_illustration.png" width="70%" style="display: block; margin: auto;" /> --- ## Tuning strategy **1.** `#Trees` Fix all other hyperparameters to default values and the learning rate rather small to determine `\(K \in [100,2000]\)` to minimize the validation loss. **2.** `Tree and the basic regularization hyperparameters` .pull-left[ `Learning rate`: `\(\eta \in [1e^{-10},0.1]\)` `Max tree depth`: `\(max(J^{(k)}) \in [1,15]\)` `Loss reduction`: `\(\nu \in [1e^{-10},30]\)` ] .pull-right[ `Column-subsampling`: `\([1,\text{#features}]\)` `Row-subsampling`: `\([0.1,1]\)` ] Evaluate a set of combinations of these and determine the combinations that again minimize the validation loss. (Set of combinations: space-filling design) **3.** Refine the set of combinations and repeat 2. to find the final combination. <!-- .footnote[[*] Hastie, T., Tibshirani, R. and Friedman, J. (2009). The elements of statistical learning (12th printing). Springer New York.] --> --- ## Tuning the number of trees <br> <img src="data:image/png;base64,#C:/Users/Emanuel/Documents/Uni/3. MA Semester/Seminar/boosting_methods/_pictures/burn_boost_tune_first.png" width="80%" style="display: block; margin: auto auto auto 0;" /> 👉 Fix 1500 trees. --- ## Space-filling grid search Tuning both the model with the raw outcome and the transformed one with space-filling grids came at a cost: **8 hours** of parallel computation. <br> **The final parameters:** .pull-left[ `#Trees`: 1500 `Max tree depth`: 4 `Learning rate`: 0.02 ] .pull-right[ `Loss reduction`: roughly 0 `Column-subsampling`: not enabled `Row-subsampling`: 80% ] --- ## Raw vs transformed target The two models behaved exactly the same while tuning and show no notable differences in any of the following visualizations so the focus will now be on the raw model. <img src="data:image/png;base64,#C:/Users/Emanuel/Documents/Uni/3. MA Semester/Seminar/boosting_methods/_pictures/bout_raw_vs_trans.png" width="70%" style="display: block; margin: auto auto auto 0;" /> --- ## Feature importance The feature importance is basically the gain w.r.t. the loss of the feature over all splits. <img src="data:image/png;base64,#C:/Users/Emanuel/Documents/Uni/3. MA Semester/Seminar/boosting_methods/_pictures/vip_burn.png" width="80%" style="display: block; margin: auto auto auto 0;" /> --- ## Handling of missing values <img src="data:image/png;base64,#C:/Users/Emanuel/Documents/Uni/3. MA Semester/Seminar/boosting_methods/_pictures/miss_vals_burn2.png" width="75%" style="display: block; margin: auto auto auto 0;" /> The default directions of XGBoost seem to work well. --- ## Final performance <img src="data:image/png;base64,#C:/Users/Emanuel/Documents/Uni/3. MA Semester/Seminar/boosting_methods/_pictures/final_perf_burn.png" width="80%" style="display: block; margin: auto auto auto 0;" /> --- ## Insurance data set **Goal:** Predict the medical costs (Charges) billed by health insurance. .left-column[ <br> **Number of predictors:** 6 **Training data:** 1071 rows **Test data:** 267 rows **Missing:** None ] .right-column[ <img src="data:image/png;base64,#C:/Users/Emanuel/Documents/Uni/3. MA Semester/Seminar/boosting_methods/_pictures/charges_ins_nojitter.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Fitting and tuning - Again all variables are used without normalization. - Dummy-encoding of all nominal variables. - Resampling setup with 5-fold CV is the same. - The tuning strategy and ranges are unchanged. - The tuning w.r.t. the number of trees resulted in a value of 600. - The space-filling grid search over the remaining 5 hyperparameters took 30 minutes in total. **The final parameters:** .pull-left[ `#Trees`: 600 `Max tree depth`: 3 `Learning rate`: 0.02 ] .pull-right[ `Loss reduction`: 0.03 `Column-subsampling`: 7 of 8 `Row-subsampling`: 80% ] -- <!-- <br> .center[ <img src="data:image/png;base64,#https://media.giphy.com/media/ZNHlhLQb1pm5FYXoWy/giphy.gif" width="300" height="300" /> ] --> <!-- ## Tuning the number of trees <img src="data:image/png;base64,#C:/Users/Emanuel/Documents/Uni/3. MA Semester/Seminar/boosting_methods/_pictures/boost_ins_tune_plot1.png" width="90%" style="display: block; margin: auto;" /> So a closer look at the region around 500 is advisable. ## Tuning the number of trees <img src="data:image/png;base64,#C:/Users/Emanuel/Documents/Uni/3. MA Semester/Seminar/boosting_methods/_pictures/boost_ins_tune_plot2.png" width="90%" style="display: block; margin: auto;" /> 👉 Fix 600 trees. --> <!-- ## A glimpse at learning rates <img src="data:image/png;base64,#C:/Users/Emanuel/Documents/Uni/3. MA Semester/Seminar/boosting_methods/_pictures/boost_ins_tune_lrate.png" width="90%" style="display: block; margin: auto;" /> Thus a too small learning rate can also be a problem. --> <!-- ## Space-filling grid search - 600 combinations. - 900k trees. - 30 minutes of parallel computation. <br> **The final parameters:** .pull-left[ `#Trees`: 600 `Max tree depth`: 3 `Learning rate`: 0.02 ] .pull-right[ `Loss reduction`: 0.03 `Column-subsampling`: 7 of 8 `Row-subsampling`: 80% ] --> <!-- `#Trees`: 600 `Max tree depth`: 3 `Learning rate`: 0.02 `Loss reduction`: 0.03 `Column-subsampling`: 7 of 8 `Row-subsampling`: 80% --> --- ## Feature importance <img src="data:image/png;base64,#C:/Users/Emanuel/Documents/Uni/3. MA Semester/Seminar/boosting_methods/_pictures/vip_ins.png" width="90%" style="display: block; margin: auto auto auto 0;" /> --- ## Final Performance <img src="data:image/png;base64,#C:/Users/Emanuel/Documents/Uni/3. MA Semester/Seminar/boosting_methods/_pictures/final_perf_ins.png" width="80%" style="display: block; margin: auto auto auto 0;" /> --- ## Pros - **Minimal pre-processing** - **Flexible enough to detect complex non-linear patterns** - **Handling of missing values** - **Integrated feature selection** - **Good generalization due to lots of regularization options** - **Strong predictive power** ## Cons - **Not easily explainable** - **Computationally demanding (especially the hyperparameter tuning)** - **Better if there are lots of observations** --- class: inverse, center, middle, hide-logo ## Thank you and happy boosting! Check out the bookdown project for more! --- class: inverse, center, middle, hide-logo ## Discussion --- ## Interesting residuals
--- ## R setup .pull-left[ <img src="data:image/png;base64,#img/tidyverse_hex.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="data:image/png;base64,#img/tidymodels_hex.png" width="80%" style="display: block; margin: auto;" /> ] <br> Amazing book: **Tidy modeling with R** --- ## Compressed tree visualization <img src="data:image/png;base64,#C:/Users/Emanuel/Documents/Uni/3. MA Semester/Seminar/boosting_methods/_pictures/ins_multitree_xgb.png" width="90%" style="display: block; margin: auto auto auto 0;" />